NPS Inventory and Monitoring Division Intro to R Training: March 24 - 26, 2026

Prep for Training

Installing required software

The only prerequisite for the R training is to install the latest version of R and RStudio on your computer. These should be available in the Company Portal in Entra, and shouldn’t require special permissions to install. We’ll talk about the difference between R and RStudio on the first day, but for now, just make sure they’re installed.

Your R version should be at least 4.4.3 or above to make sure everyone’s code behaves the same way. Likely the version you have is 4.5.1.

Required Packages

A number of packages are required to follow along with data wrangling and visualization sessions. Please try to install these in RStudio ahead of time by running the code below. If you don’t know how to run the code, open view the Running Code Screencast below for how to do this.

packages <- c("tidyverse", "ggthemes", "GGally", "RColorBrewer", 
              "viridis", "scales", "plotly", "patchwork")

install.packages(setdiff(packages, rownames(installed.packages())))  

# Check that installation worked
library(tidyverse) # turns on all tidyverse packages
library(ggthemes)
library(GGally)
library(RColorBrewer)
library(viridis)
library(scales)
library(plotly)
library(patchwork)
Running Code Screencast


Optional Reading This is completely optional, but if you have any time before training starts, I highly recommend reading Chapter 2: R basics and workflows in STAT545. This is an online book based on a graduate level statistics class that Dr. Jenny Bryan teaches and is turning into this book. She’s a stats professor turned senior programmer at RStudio, and I really like how she approaches programming R in this book.


Structure of training

Timing: The training will take place over 3 half days. Each day will run from 9 - 1 EST via MS Teams. The hour before and afternoon following training I will be available as office hours, in case there are questions that couldn’t be handled during training.

Structure: For most of the training, I will share my screen as I go through the website and then demo with live coding. Having 2 screens, one for my screen share and one for your R session, will make following along a lot easier.

Getting Help: I intentionally included people in this training I know to be kind, capable, and that will benefit from having better R skills. My hope is that this group is small enough and supportive enough that everyone feels comfortable openly asking questions and providing feedback to the group. However, if you aren’t comfortable asking questions openly, you can ask questions in the anonymous training feedback form. Additionally, if someone runs into an issue I can’t immediately troubleshoot (it happens), we may have to table it until office hours, then will discuss it with the group the next day (if relevant).

Objectives: Three days is barely enough to scratch the surface on what you can do in R. My goals with this training are to:
  1. Help you get beyond the initial learning curve that can be really challenging to climb on your own.
  2. Expose you to some of the really useful things R can do.
  3. Provide you the tools needed to continue advancing your R skills on your own.
Credit: Much of this training was borrowed heavily from IMD Intro to R training in 2022. A ton of credit for this training goes to the developers of those lessons:
  • Day 1 Intro to R: Sarah Wright and Andrew Birch
  • Day 2 Data Wrangling: John Paul Schmit and Lauren Pandori
  • Day 3 Data Visualization: Ellen Cheng and Kate Miller (Spatial Data)
  • Day 4 Programming Best Practices: Sarah Wright and Thomas Parr

Feedback: Finally, to help me improve this training for future sessions, please leave feedback in the training feedback form. You can submit feedback multiple times and don’t need to answer every question. Responses are anonymous.


Day 1: Intro to R

Day 1 Goals

Overall goals for the first day are:

  1. Get comfortable navigating RStudio IDE, such as opening a new project or script, running code and viewing the output, etc.
  2. Ability to import and save CSV and .xlsx files.
  3. Basic understanding of variables, functions, and data frames.
  4. Ability to explore data frames, such as the number of columns, min/max of different columns, data type of column, basic plotting, etc.
  5. Basic understanding of square brackets to view and filter data[,]. This is probably the hardest concept you’ll learn this week. The next two days I will show you easier ways to work with your data that don’t involve brackets.
  6. Able to access help within and outside of R.
R journey
Artwork by @allison_horst

Intro to R

Why I love R:

R welcoming illustration Artwork by @allison_horst

There are many reasons to use R. Some of my top reasons are below:
  • It’s free!
  • Thorough, helpful, and welcoming user community, including a ton of freely available online help and learning resources.
  • Large user community in NPS to collaborate and share code.
  • Language was designed by statisticians to facilitate data analysis and visualization.
    • Relatively shallow learning curve compared to other coding languages (e.g. python).
    • Developers’ philosophy was to make is so you don’t have to know how to program to learn R. Then as you become more advanced and do more complicated tasks, learning how to program will benefit your work.
  • Code documents your workflow.
  • Code builds on code.
  • Automating tasks, like QA/QC, compiling/querying data, calculating summary statistics, and building dashboards has improved our data quality and made our data accessible and easy to work with for other users.

Other benefits of R:
  • Base R maintains backwards compatibility, so that code written in base R, regardless of R version should run.
    • Caveats are that packages are not guaranteed to be backwards compatible.
    • Python is not backwards compatible
  • The tidyverse, which is a collection of really useful R packages, makes code more readable and consistently formatted. Tidyverse packages aren’t always backwards compatible, but they tend to be pretty stable and are super helpful for data wrangling and plotting.
  • RStudio can integrate with other coding languages, such as SQL, HTML/CSS, python and javascript.

Recipe for learning R Everyone learns differently, but the ingredients I see that most ensure success are:
  • Community: Finding a group of other R users you can reach out to when you’re stuck or need feedback is invaluable. I was lucky to be part of a group that was learning R together. I still collaborate with many of these folks. I’m hoping you all see this group as your R community.
  • Persistence: Keep trying new things in R, even if you ultimately have to abandon attempts and go back to what you know, like Access or Excel. Persistence pays off.
  • Fearlessness: You have to be okay with failing the first, second, or tenth time to solve a problem, at least at first. As you get more comfortable, your success rate will improve. Throughout that entire process, you’re learning R.
  • Googling: Half of being a good coder is learning how to google what you’re trying to do, or the issue you’re having. At first, you may not find the answers you’re looking for, but by reading help pages, like StackOverflow, you’re learning to read code and seeing solutions that may help you in the future.
Debugging rollercoaster
Artwork by @allison_horst

AI soapbox Why I don’t use AI to write code:
  • Behind the scenes, AI is taking answers from websites and other sources without crediting them.
  • There’s a huge environmental footprint to run the generative AI servers.
  • Research has shown that people who use AI to write for them lose their ability to write and think critically over time. Writing code isn’t that different. If you’re not actively writing the code you are using, your ability to debug and verify code is doing what you expect may be weakened.
  • To not have AI responses returned by google searchers, include -ai in google search box.

R and RStudio

About R

R is a programming language that was originally developed by statisticians for statistical computing and graphics. R is free and open source. That means you will never need a paid license to use it, and you can view the underlying source code of any function and suggest fixes and improvements. Since its first official release in 1995, R remains one of the leading programming languages for statistics and data visualization, and its capabilities continue to grow.

When you install R, it comes with a simple user interface that lets you write and execute code. However, writing code in this interface is similar to writing a report in Notepad: it’s simple and straightforward, but you likely need more features than Notepad has to format your document. This is where RStudio comes in.

For more information on the history of R, visit the R Project website.


About RStudio RStudio is what’s called an integrated development environment (IDE), which is essentially a shell around the R program. RStudio makes programming in R easier by color coding different types of code, auto-completing code, flagging mistakes (like spellcheck for code), and providing many useful tools with the push of a button or key stroke (e.g. viewing help info).


RStudio Anatomy When you open RStudio, you typically see 4 panes:
RStudio panes

Source

This is primarily where you write code. When you create a new script or open an existing one, it displays here. In the screenshot above, there’s a script called bat_data_wrangling.R open in the source pane. Note that if you haven’t yet opened or created a new script, you won’t see this pane until you do.

The source pane color-codes your code to make it easier to read, and detects syntax errors (the coding equivalent of a grammar checker) by flagging the line number with a red “x” and showing a squiggly line under the offending code.

When you’re ready to run all or part of your script:
  • Highlight the line(s) of code you want to run
  • Either click the “Run” button (top right of the source pane) or press Ctrl+Enter.
At this point, the code is sent to the console (the bottom left pane). You’ll first see your code appear in the console, and then you’ll see the output if there is any.

Console

This is where the code actually runs. When you first open RStudio, the console will tell you the version of R that you’re running (should be R 4.4.1 or greater).

While most often you’ll run code from a script in the source pane, you can also run code directly in the console. Code in the console won’t get saved to a file, but it’s a great way to experiment and test out lines of code before adding them to your script in the source pane. The console is also where errors appear if your code breaks. Deciphering errors can be a challenge that gets easier over time. Googling errors is a good place to start.

Environment/History/Connections
  • Environment: This is where you can see what is currently in your environment. Think of the environment as temporary storage for objects - things like datasets and stored values - that you are using in your script(s). You can also click on objects and view them. Anything you see in your environment is temporary and it will disappear when you restart R. If there is something in your environment that you want to access in the future, make sure your script is able to reproduce it (or save it to a file).
  • History: This shows the code you’ve run in the current session. It’s not good to rely on it, but it can be a way to recover code you ran in the console and later realized you needed in your script.
  • Connections: This is one way to connect R to a database.
  • Git: If you have installed Git on your computer, you may see a Git tab. We won’t talk much about it this week, but this is where you’ll keep track of changes to your code.
  • Tutorial: This has some interactive tutorials that you can check out if you are interested.

Files/Plots/Packages/Help/Viewer
  • Files: This tab shows the files within your working directory (typically the folder where your current code project lives). More on this later.
  • Plots: This tab will show plots that you create.
  • Packages: This tab allows you to install, load, and update packages, and also view the help files within each package. You can also access these files in code.
  • Help: Allows you to search for and view documentation for packages that are installed on your computer.
  • Viewer: Shows HTML outputs produced by R Markdown, R Shiny, and some plotting and mapping packages.


RStudio Global Options There are several settings in the Global Options that everyone should check to make sure we all have consistent settings. Go to Tools -> Global Options and follow the steps below.
  1. Under the General tab, you should see that your R Version is [64-bit] and the version is R-4.4.3 or greater. If it’s not, you probably need to update R. Let me know if you need help with this.
  2. Also in the General tab, make sure you are not saving your environment. To do this, uncheck the Restore .RData into your workspace at startup option. When this option is checked, R Studio saves your current working environment (the stuff in the Environment tab) when you exit. The next time you open R Studio, it restores that same environment.
    • This may seem like a good thing, but a main point of using R is that your code should return the same results every time you run it. Clearing the environment every time you close RStudio forces you to run your code with a clean slate.
    • Set Save workspace to .RData on exit: to Never. The only reason not to set to “Never” is if you are working with a huge dataset that takes a long time to load and process. In that case, you may want to set Save workspace to .RData on exit to “Ask”. When you close RStudio, it will ask you if you want to save your workspace image.
  3. Change default pipe to base R pipe by gong to the Code tab, and check the box Use native pipe operator, |> (requires R 4.1+). We will discuss what this pipe means tomorrow.
  4. Most other settings are whatever you prefer. Note that to change the color of your background and text, go to the Appearance tab. I prefer Cobalt.

Project and File Setup

File organization

File organization is an important part of being a good coder. Keeping code, input data, and results together in one place will protect your sanity and the sanity of the person who inherits the project. R Studio projects help with this. Creating a new R Studio project for each new code project makes it easier to manage settings and file paths.

Before we create a project, take a look at the Console tab. Notice that at the top of the console there is a folder path. That path is your current working directory.
<img src = “./images/WorkingDir_NoProject.png” alt = ‘Default working directory’ width:60%;>

If you refer to a file in R using a relative path, for example ./data/my_data_file.csv, R will look in your current working directory for a folder called data containing a file called my_data_file.csv.

Note the use of forward slashes instead of back slashes for file paths. You can either use a forward slash (/) or a double back slash for file paths. The paths below are equivalent and the full file path the relative path above is specifying.

# forward slash file path approach
"C:/Users/KMMiller/OneDrive = DOI/data/"
## [1] "C:/Users/KMMiller/OneDrive = DOI/data/"
# backward slash file path approach
"C:\\Users\\KMMiller\\OneDrive = DOI\\data\\"
## [1] "C:\\Users\\KMMiller\\OneDrive = DOI\\data\\"
Using relative paths is a helpful because the full path will be specific to your computer and likely won’t work on a different computer. But there’s no guarantee that everyone has the same default R working directory. This is where projects come in. Projects package all of your code, data, output, etc. into a file type that is easily transferrable to other machines regardless of file location.

Start a new Project To demonstrate the value of a project, we’ll create and use one for this class. Click File > New Project. In the window that appears, select New Directory, then select New Project. You will be prompted for a directory name. This is the name of your project folder. For this class, call it imd_r_intro. Next, you’ll select what folder to keep your project folder in. Documents/R is a good place to store all of your R projects but it’s up to you. When you are done, click on Create Project.

Step 1. Select New Directory New project step 1

Step 2. Select New Project New project step 2

Step 3. Name project imd_r_intro Save project to a place you can find it. Don’t worry about whether the git repository box is checked or not.
New project step 3


If you successfully started a project named imd_r_intro, you should see it listed at the very top right of your screen. As you start new projects, you’ll want to check that you’re working in the right one before you start coding. Take a look at the Console tab again. Notice that your current working directory is now your project folder. When you look in the Files tab of the bottom right pane, you’ll see that it also defaults to the project folder.

We also want to create a folder called “data”, where we will store datasets we’re using for this class. To do that, you can either go to Windows Explorer and add a new folder, or you run the code below. As long as you’re working within your project (project name should be at the top right of window), a folder named data will appear within your project. You can check that it worked by using the list.files() function, which lists everything in the working directory of your project.

Create data folder

dir.create("data")
list.files() # you should see a data folder listed 

Start coding

Start a new script First let’s create a new R script file called day_1_script.R. Make sure you are working in the imd_r_intro project that you just created. Click on the New File icon new script in the top left corner. In the dropdown, select R Script. The source pane will appear with an untitled empty script. Go ahead and save it by clicking the Save icon (and make sure the Source on Save checkbox is deselected). Call your new script day_1_script.R.

Coding basics

We’ll start with something simple. Basic math in R is pretty straightforward and the syntax is similar to simply using a graphing calculator. You can use the examples below or come up with your own. Even if you’re using the examples, try to actually type the code instead of copy-pasting - you’ll learn to code faster that way.

To run a single line of code in your script, place your cursor anywhere in that line and press CTRL+ENTER (or click the Run button in the top right of the script pane). To run multiple lines of code, highlight the lines you want to run and hit CTRL+ENTER or click Run.

To leave notes in your script, use the hashtag/pound sign (#). This will change the color of text that R reads as a comment and doesn’t run. Commenting your code is one of the best habits you can form. Comments are a gift to your future self and anyone else who tries to use your code.

Type code below in your script and run each line

# By using this hashtag/pound sign, you are telling R to ignore the text afterwards. This is useful for leaving annotation of notes or instructions for yourself, or someone else using your code

# try this line to generate some basic text and become familiar with where results will appear:
print("Hello, lets do some basic math. Results of operations will appear here")
## [1] "Hello, lets do some basic math. Results of operations will appear here"
# one plus one
1+1
## [1] 2
# two times three, divided by four
(2*3)/4
## [1] 1.5
# basic mathematical and trigonometric functions are fairly similar to what they would be in excel

# calculate the square root of 9
sqrt(9)
## [1] 3
# calculate the cube root of 8 (remember that x^(1/n) gives you the nth root of x)
8^(1/3)
## [1] 2
# get the cosine of 180 degrees - note that trig functions in R expect angles in radians
# also note that pi is a built-in constant in R
cos(pi)
## [1] -1
# calculate 5 to the tenth power
5^10
## [1] 9765625
Notice that when you run a line of code, the code and the result appear in the console. You can also type code directly into the console, but it won’t be saved anywhere. As you get more comfortable with R, it can be helpful to use the console as a “scratchpad” for experimenting and troubleshooting. For now, it’s best to err on the side of saving your code as a script so that you don’t accidentally lose useful work.


Variables

Occasionally, it’s enough to just run a line of code and display the result in the console. But typically our code is more complex than adding one plus one, and we want to store the result and use it later in the script. This is where variables come in. Variables allow you to assign a value (whether that’s a number, a data table, a chunk of text, or any other type of data that R can handle) to a short, human-readable name. Anywhere you put a variable in your code, R will replace it with its value when your code runs. Variables are also called objects in R.

R uses the <- symbol for variable assignment. If you’ve used other programming languages, you may be tempted to use = instead. It will work, but there are subtle differences between <- and =, so you should get in the habit of using <-.

R is case-sensitive. So if you name one object treedata and another Treedata or TREEDATA, R will interpret these all as unique objects. While you can do things like this, it’s best practice not to use the same name for different objects, as it makes code difficult to follow.

Type code below to assign values to variables named a and b

# the value of 12.098 is assigned to variable 'a'
a <- 12.098

# and the value 65.3475 is assigned to variable 'b'
b <- 65.3475

# we can now perform whatever mathematical operations we want using these two variables without having to repeatedly type out the actual numbers:

a*b
## [1] 790.5741
(a^b)/((b+a))
## [1] 7.305156e+68
sqrt((a^7)/(b*2))
## [1] 538.7261

In the code above, we assign the variables a and b once. We can then reuse them as often as we want. This is helpful because we save ourselves some typing, reduce the chances of making a typo somewhere, and if we need to change the value of a or b, we only have to do it in one place.

Also notice that when you assign variables, you can see them listed in your Environment tab (top right pane). Remember, everything you see in the environment is just in R’s temporary memory and won’t be saved when you close out of RStudio.

All of the examples you’ve seen so far are fairly contrived for the sake of simplicity. Let’s take a look at some code that everyone here will make use of at some point: reading data from a CSV.


Functions

It’s hard to get very far in R without making use of functions. Think of a function as a programmed task that takes some kind of input (the argument(s)) from the user and outputs a result (the return value).

anatomy of a function


Note the difference in how RStudio color codes what it thinks are functions. There are a lot of pre-programmed functions in base R, which is what comes along with R when you install R. Installing R packages will add additional functions. You can also build your own. Names that R recognizes as a function are color coded differently than what R recognizes as text, numbers, etc. It’s also good practice to not use existing functions as new object names.

Commonly used base R functions include:
  • mean(): calculate the mean of a set of numbers
  • min(): calculate the minimum of a set of numbers
  • max(): calculate the maximum of a set of numbers
  • range(): calculate the min and max of a set of numbers
  • sd(): calculate the standard deviation of set of numbers
  • sqrt(): calculate the square root of a value

Calculate mean and range to see how functions work

x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
# equivalent to x <- 1:10

# bad coding
#mean <- mean(x)

# good coding 
mean_x <- mean(x)
mean_x
## [1] 5.5
range_x <- range(x)
range_x
## [1]  1 10


Importing and saving data

Most of the work we do in R relies on one or more existing datasets that we want to query or summarize, rather than creating our own in R. Importing data in R is therefore an important skill. R can import just about any data type, including CSV and MS Excel files. You can also import tables from MS Access and SQL databases using ODBC drivers. That’s beyond the scope of this class, but I can share examples for anyone needing to import from a database. For now, I’ll show how to work with CSVs and Excel spreadsheets.

Import CSV

We use the read.csv() function to import CSVs in R. The read.csv() function takes the file path or url to the CSV as input and outputs a data frame containing the data from the CSV. Here we’re going to read a CSV from a website, then save that in the data folder of our project.

Run the following line to import a teaching ACAD wetland dataset from the github repository for this training

# read in the data from ACAD_wetland_data_clean.csv and assign it as a dataframe to the variable "ACAD_wetland"
ACAD_wetland <- read.csv(
  "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/ACAD_wetland_data_clean.csv"
  )

View the data in a separate window by running the View() function.

# View the ACAD_wetland data frame we just created
View(ACAD_wetland)

Or, check out the first few or last few records in your console.

# Look at the top 6 rows of the data frame
head(ACAD_wetland)
##   Site_Name Site_Type          Latin_Name           Common Year PctFreq Ave_Cov
## 1    SEN-01  Sentinel         Acer rubrum        red maple 2011       0    0.02
## 2    SEN-01  Sentinel         Amelanchier     serviceberry 2011      20    0.02
## 3    SEN-01  Sentinel Andromeda polifolia     bog rosemary 2011      80    2.22
## 4    SEN-01  Sentinel    Arethusa bulbosa   dragon's mouth 2011      40    0.04
## 5    SEN-01  Sentinel  Aronia melanocarpa black chokeberry 2011     100    2.64
## 6    SEN-01  Sentinel        Carex exilis    coastal sedge 2011      60    6.60
##   Invasive Protected  X_Coord Y_Coord
## 1    FALSE     FALSE 574855.5 4911909
## 2    FALSE     FALSE 574855.5 4911909
## 3    FALSE     FALSE 574855.5 4911909
## 4    FALSE      TRUE 574855.5 4911909
## 5    FALSE     FALSE 574855.5 4911909
## 6    FALSE     FALSE 574855.5 4911909
# Look at the bottom 6 rows of the data frame
tail(ACAD_wetland)
##     Site_Name Site_Type                      Latin_Name
## 503    RAM-05       RAM             Vaccinium oxycoccos
## 504    RAM-05       RAM           Vaccinium vitis-idaea
## 505    RAM-05       RAM Viburnum nudum var. cassinoides
## 506    RAM-05       RAM Viburnum nudum var. cassinoides
## 507    RAM-05       RAM                   Xyris montana
## 508    RAM-05       RAM                   Xyris montana
##                         Common Year PctFreq Ave_Cov Invasive Protected X_Coord
## 503            small cranberry 2012     100    0.04    FALSE     FALSE  553186
## 504                lingonberry 2017      25    0.02    FALSE     FALSE  553186
## 505       northern wild raisin 2017     100    0.84    FALSE     FALSE  553186
## 506       northern wild raisin 2012     100   63.00    FALSE     FALSE  553186
## 507 northern yellow-eyed-grass 2017      50    0.44    FALSE     FALSE  553186
## 508 northern yellow-eyed-grass 2012      50    1.24    FALSE     FALSE  553186
##     Y_Coord
## 503 4899764
## 504 4899764
## 505 4899764
## 506 4899764
## 507 4899764
## 508 4899764

Save CSV

Now write the csv to disk and show how to import from your computer.

# Write the data frame to your data folder using a relative path. 
# By default, write.csv adds a column with row names that are numbers. I don't
# like that, so I turn that off.
write.csv(ACAD_wetland, "./data/ACAD_wetland_data_clean.csv", row.names = FALSE)

Make sure the writing to disk worked by importing the CSV from your computer

# Read the data frame in using a relative path
ACAD_wetland <- read.csv("./data/ACAD_wetland_data_clean.csv")

# Equivalent code to read in the data frame using full path on my computer, but won't match another user.
ACAD_wetland <- read.csv("C:/Users/KMMiller/OneDrive - DOI/NETN/R_Dev/IMD_R_Training_2026/data/ACAD_wetland_data_clean.csv")

We’ll get very familiar with data frames in this class, but for the moment just know that it’s a rectangular table of data with rows and columns. Data frames are typically organized with rows being records or observations (e.g. sampling locations, individual critters, etc.), and columns being variables that characterize those observations (e.g., species, size, date collected, x/Y coordinates). Once you have read the data in, you can take a quick look at its structure by typing the name of the variable it’s stored in.

Import from XLSX

Base R does not have a way to import MS Excel files. The first step for working with Excel files (i.e., files with .xls or .xlsx extensions), therefore, is to install the readxl package to import .xlsx files and writexl to write files to .xlsx. The readxl package has a couple of options for loading Excel spreadsheets, depending on whether the extension is .xls, .xlsx, or unknown, along with options to import different worksheets within a spreadsheet.

The code below installs the required packages, loads them, then first writes the ACAD_wetland CSV we just imported to an .xlsx. The last step imports the .xslx version of the ACAD wetland data.

  1. Install packages
  2. install.packages("readxl") # only need to run once. 
    install.packages("writexl")
  3. Load packages
  4. library(writexl) # saving xlsx
    library(readxl) # importing xlsx
  5. Write CSV to .xlsx to data folder. I’m going in this order to keep this training stand-alone. The read_xlsx() function can’t read from a url like read.csv() can.
  6. write_xlsx(ACAD_wetland, "./data/ACAD_wetland_data_clean.xlsx")
  7. Import spreadsheet. Note that the default settings import the first sheet, so I didn’t really need to specify the sheet below. I included the sheet argument to show how it’s done.
  8. ACAD_wetxls <- read_xlsx(path = "./data/ACAD_wetland_data_clean.xlsx", sheet = "Sheet1") 
  9. View top 6 rows to check the data
  10. head(ACAD_wetxls)
    ## # A tibble: 6 × 11
    ##   Site_Name Site_Type Latin_Name Common  Year PctFreq Ave_Cov Invasive Protected
    ##   <chr>     <chr>     <chr>      <chr>  <dbl>   <dbl>   <dbl> <lgl>    <lgl>    
    ## 1 SEN-01    Sentinel  Acer rubr… red m…  2011       0    0.02 FALSE    FALSE    
    ## 2 SEN-01    Sentinel  Amelanchi… servi…  2011      20    0.02 FALSE    FALSE    
    ## 3 SEN-01    Sentinel  Andromeda… bog r…  2011      80    2.22 FALSE    FALSE    
    ## 4 SEN-01    Sentinel  Arethusa … drago…  2011      40    0.04 FALSE    TRUE     
    ## 5 SEN-01    Sentinel  Aronia me… black…  2011     100    2.64 FALSE    FALSE    
    ## 6 SEN-01    Sentinel  Carex exi… coast…  2011      60    6.6  FALSE    FALSE    
    ## # ℹ 2 more variables: X_Coord <dbl>, Y_Coord <dbl>


Data structures

We’re going to take a little detour into data structures at this point. It’ll all tie back in to our tree data.

The data frame we just examined is a type of data structure. A data structure is what it sounds like: it’s a structure that holds data in an organized way. There are multiple data structures in R, including vectors, lists, arrays, matrices, data frames, and tibbles (more on this unfortunately-named data structure later). Today we’ll focus on vectors and data frames.

Vectors

Vectors are the simplest data structure in R. You can think of vectors as one column of data in an Excel spreadsheet, and the elements are each row in the column. Here are some examples of vectors:

digits <- 0:9  # Use x:y to create a sequence of integers starting at x and ending at y
digits
##  [1] 0 1 2 3 4 5 6 7 8 9
is_odd <- rep(c(FALSE, TRUE), 5)  # Use rep(x, n) to create a vector by repeating x n times 
is_odd
##  [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
shoe_sizes <- c(7, 7.5, 8, 8.5, 9, 9.5, 10, 10.5, 11, 11.5)
shoe_sizes
##  [1]  7.0  7.5  8.0  8.5  9.0  9.5 10.0 10.5 11.0 11.5
favorite_birds <- c("black-capped chickadee", "dark-eyed junco", "golden-crowned kinglet")
favorite_birds
## [1] "black-capped chickadee" "dark-eyed junco"        "golden-crowned kinglet"

Note the use of c(). The c() function stands for combine, and it combines elements into a single vector. The c() function is a fairly universal way to combine multiple elements in R, and you’re going to see it over and over.

Let’s play around with vectors a little more. We can use is.vector() to test whether something is a vector. We can get the length of a vector with length(). Note that single values in R are just vectors of length one.

is.vector(digits)  # Should be TRUE
## [1] TRUE
is.vector(favorite_birds)  # Should also be TRUE
## [1] TRUE
length(digits)  # Hopefully this is 10
## [1] 10
length(shoe_sizes)
## [1] 10
# Even single values in R are stored as vectors
length_one_chr <- "length one vector"
length_one_int <- 4
is.vector(length_one_chr)
## [1] TRUE
is.vector(length_one_int)
## [1] TRUE
length(length_one_chr)
## [1] 1
length(length_one_int)
## [1] 1
In the examples above, each vector contains a different type of data. digits contains integers, is_odd contains logical (true/false) values, favorite_birds contains text, and shoe_sizes contains decimal numbers. That’s because a given vector can only contain a single type of data.


Data Types

In R, there are four data types that we will typically encounter:

  • character Regular text, denoted with double or single quotation marks (e.g. "hello", "3", "R is my favorite programming language")
  • numeric Decimal numbers (e.g. 23, 3.1415)
  • integer Integers. If you want to explicitly denote a number as an integer in R, append L to it or use as.integer() (e.g. 5L, as.integer(30)).
  • logical True or false values (TRUE, FALSE). Note that TRUE and FALSE must be all-uppercase.

There are two more data types, complex and raw, but you are unlikely to encounter these so we won’t cover them here.

You can use the class() function to get the data type of a vector:

class(favorite_birds)
## [1] "character"
class(shoe_sizes)
## [1] "numeric"
class(digits)
## [1] "integer"
class(is_odd)
## [1] "logical"

If you need to access a single element of a vector, you can use the syntax my_vector[x] where x is the element’s index (the number corresponding to its position in the vector). You can also use a vector of indices to extract multiple elements from the vector. Note that in R, indexing starts at 1 (i.e. my_vector[1] is the first element of my_vector). If you’ve coded in other languages, you may be used to indexing starting at 0.

second_favorite_bird <- favorite_birds[2]
second_favorite_bird
## [1] "dark-eyed junco"
top_two_birds <- favorite_birds[c(1,2)]
top_two_birds
## [1] "black-capped chickadee" "dark-eyed junco"

Logical vectors can also be used to subset a vector. The logical vector must be the length of the vector you are subsetting.

odd_digits <- digits[is_odd]
odd_digits
## [1] 1 3 5 7 9


Data frames

Let’s revisit our wetland data frame. We’ve explored the data frame as a whole, but it’s often useful to look at one column at a time. To do this, we’ll use the $ syntax:

See list of all sites and species in the wetland data (output truncated at 10 records)

ACAD_wetland$Site_Name
ACAD_wetland$Latin_Name
##  [1] "SEN-01" "SEN-01" "SEN-01" "SEN-01" "SEN-01" "SEN-01" "SEN-01" "SEN-01"
##  [9] "SEN-01" "SEN-01"
##  [1] "Acer rubrum"             "Amelanchier"            
##  [3] "Andromeda polifolia"     "Arethusa bulbosa"       
##  [5] "Aronia melanocarpa"      "Carex exilis"           
##  [7] "Chamaedaphne calyculata" "Drosera intermedia"     
##  [9] "Drosera rotundifolia"    "Empetrum nigrum"

You can also use square brackets [] to access data frame columns. Square brackets are base R’s way to view different subsets of your data. I’m only going to touch briefly on this, so you have a basic understanding of how to interpret square brackets. Tomorrow I’ll show you much easier ways to subset your data using tidyverse functions.

Every data frame has 2 dimensions. The first dimension is rows and the second is columns. The code below asks for the dimensions of the ACAD_wetland data frame, and returns 508 11. That means there are 508 rows, and 11 columns. The square brackets allow you to either subset rows, columns, or both at the same time, with rows specified first and columns second.

Return data frame number of rows and columns by checking data frame dimensions

dim(ACAD_wetland)
## [1] 508  11
nrow(ACAD_wetland) # first dim
## [1] 508
ncol(ACAD_wetland) # second dim
## [1] 11

Return first 5 rows of the data frame

ACAD_wetland[1:5,]
##   Site_Name Site_Type          Latin_Name           Common Year PctFreq Ave_Cov
## 1    SEN-01  Sentinel         Acer rubrum        red maple 2011       0    0.02
## 2    SEN-01  Sentinel         Amelanchier     serviceberry 2011      20    0.02
## 3    SEN-01  Sentinel Andromeda polifolia     bog rosemary 2011      80    2.22
## 4    SEN-01  Sentinel    Arethusa bulbosa   dragon's mouth 2011      40    0.04
## 5    SEN-01  Sentinel  Aronia melanocarpa black chokeberry 2011     100    2.64
##   Invasive Protected  X_Coord Y_Coord
## 1    FALSE     FALSE 574855.5 4911909
## 2    FALSE     FALSE 574855.5 4911909
## 3    FALSE     FALSE 574855.5 4911909
## 4    FALSE      TRUE 574855.5 4911909
## 5    FALSE     FALSE 574855.5 4911909
ACAD_wetland[c(1, 2, 3, 4, 5),] #equivalent but more typing
##   Site_Name Site_Type          Latin_Name           Common Year PctFreq Ave_Cov
## 1    SEN-01  Sentinel         Acer rubrum        red maple 2011       0    0.02
## 2    SEN-01  Sentinel         Amelanchier     serviceberry 2011      20    0.02
## 3    SEN-01  Sentinel Andromeda polifolia     bog rosemary 2011      80    2.22
## 4    SEN-01  Sentinel    Arethusa bulbosa   dragon's mouth 2011      40    0.04
## 5    SEN-01  Sentinel  Aronia melanocarpa black chokeberry 2011     100    2.64
##   Invasive Protected  X_Coord Y_Coord
## 1    FALSE     FALSE 574855.5 4911909
## 2    FALSE     FALSE 574855.5 4911909
## 3    FALSE     FALSE 574855.5 4911909
## 4    FALSE      TRUE 574855.5 4911909
## 5    FALSE     FALSE 574855.5 4911909

Return first 5 rows and a subset of columns of the data frame

ACAD_wetland[1:5, c("Site_Name", "Latin_Name", "Common", "Year", "PctFreq")]
##   Site_Name          Latin_Name           Common Year PctFreq
## 1    SEN-01         Acer rubrum        red maple 2011       0
## 2    SEN-01         Amelanchier     serviceberry 2011      20
## 3    SEN-01 Andromeda polifolia     bog rosemary 2011      80
## 4    SEN-01    Arethusa bulbosa   dragon's mouth 2011      40
## 5    SEN-01  Aronia melanocarpa black chokeberry 2011     100

CHALLENGE: How would you look at the the first 4 even rows (2, 4, 6, 8)?

Answer
ACAD_wetland[c(2, 4, 6, 8),]
##   Site_Name Site_Type         Latin_Name           Common Year PctFreq Ave_Cov
## 2    SEN-01  Sentinel        Amelanchier     serviceberry 2011      20    0.02
## 4    SEN-01  Sentinel   Arethusa bulbosa   dragon's mouth 2011      40    0.04
## 6    SEN-01  Sentinel       Carex exilis    coastal sedge 2011      60    6.60
## 8    SEN-01  Sentinel Drosera intermedia spoonleaf sundew 2011      60    0.06
##   Invasive Protected  X_Coord Y_Coord
## 2    FALSE     FALSE 574855.5 4911909
## 4    FALSE      TRUE 574855.5 4911909
## 6    FALSE     FALSE 574855.5 4911909
## 8    FALSE     FALSE 574855.5 4911909


You can specify columns by name or by index (integer indicating position of column). It’s almost always best to refer to columns by name when possible because it makes your code easier to read and prevents your code from breaking if columns get reordered. But, in case you come across code with numbers in the column part of the brackets, here’s what it looks like. Note the empty space to the left of the comma. That means you want all rows, but only the first 4 columns.

Return all rows and first 4 columns of the data frame

ACAD_sub <- ACAD_wetland[ , 1:4] # works, but risky
ACAD_sub2 <- 
  ACAD_wetland[,c("Site_Name", "Site_Type", "Latin_Name", "Common")] #same result, but better

# compare the two data frames to the original
head(ACAD_wetland)
##   Site_Name Site_Type          Latin_Name           Common Year PctFreq Ave_Cov
## 1    SEN-01  Sentinel         Acer rubrum        red maple 2011       0    0.02
## 2    SEN-01  Sentinel         Amelanchier     serviceberry 2011      20    0.02
## 3    SEN-01  Sentinel Andromeda polifolia     bog rosemary 2011      80    2.22
## 4    SEN-01  Sentinel    Arethusa bulbosa   dragon's mouth 2011      40    0.04
## 5    SEN-01  Sentinel  Aronia melanocarpa black chokeberry 2011     100    2.64
## 6    SEN-01  Sentinel        Carex exilis    coastal sedge 2011      60    6.60
##   Invasive Protected  X_Coord Y_Coord
## 1    FALSE     FALSE 574855.5 4911909
## 2    FALSE     FALSE 574855.5 4911909
## 3    FALSE     FALSE 574855.5 4911909
## 4    FALSE      TRUE 574855.5 4911909
## 5    FALSE     FALSE 574855.5 4911909
## 6    FALSE     FALSE 574855.5 4911909
head(ACAD_sub)
##   Site_Name Site_Type          Latin_Name           Common
## 1    SEN-01  Sentinel         Acer rubrum        red maple
## 2    SEN-01  Sentinel         Amelanchier     serviceberry
## 3    SEN-01  Sentinel Andromeda polifolia     bog rosemary
## 4    SEN-01  Sentinel    Arethusa bulbosa   dragon's mouth
## 5    SEN-01  Sentinel  Aronia melanocarpa black chokeberry
## 6    SEN-01  Sentinel        Carex exilis    coastal sedge
head(ACAD_sub2)
##   Site_Name Site_Type          Latin_Name           Common
## 1    SEN-01  Sentinel         Acer rubrum        red maple
## 2    SEN-01  Sentinel         Amelanchier     serviceberry
## 3    SEN-01  Sentinel Andromeda polifolia     bog rosemary
## 4    SEN-01  Sentinel    Arethusa bulbosa   dragon's mouth
## 5    SEN-01  Sentinel  Aronia melanocarpa black chokeberry
## 6    SEN-01  Sentinel        Carex exilis    coastal sedge


Advanced bracketry

You can do more than just subset by row numbers and column names. A couple more advanced use of brackets are below. Again, this is for exposure, like if you’re reading through a StackOverflow post. There are easier ways to subset your data in R, which we will cover on Day 2. Another important point about R is that there are often multiple ways to perform a task. The best code is code that works, is easy to follow, and is unlikely to break (e.g. uses column names instead of numbers). That still means there are typically multiple equally valid approaches. There are other ways to judge good code as you advance, but for now, meeting the 3

Filter data so only invasive species = T are returned

ACAD_wetland$Latin_Name[ACAD_wetland$Invasive == TRUE]
## [1] "Berberis thunbergii"   "Berberis thunbergii"   "Berberis thunbergii"  
## [4] "Celastrus orbiculatus" "Rhamnus frangula"      "Rhamnus frangula"     
## [7] "Rhamnus frangula"      "Rhamnus frangula"      "Lonicera - Exotic"
ACAD_wetland[ACAD_wetland$Invasive == TRUE, "Latin_Name"] # equivalent
## [1] "Berberis thunbergii"   "Berberis thunbergii"   "Berberis thunbergii"  
## [4] "Celastrus orbiculatus" "Rhamnus frangula"      "Rhamnus frangula"     
## [7] "Rhamnus frangula"      "Rhamnus frangula"      "Lonicera - Exotic"

Return only unique species sorted alphabetically.

sort(unique(ACAD_wetland[, "Latin_Name"]))


##   [1] "Acer rubrum"                     "Alnus incana"                   
##   [3] "Alnus incana++"                  "Amelanchier"                    
##   [5] "Andromeda polifolia"             "Apocynum androsaemifolium"      
##   [7] "Arethusa bulbosa"                "Aronia melanocarpa"             
##   [9] "Berberis thunbergii"             "Betula populifolia"             
##  [11] "Calamagrostis canadensis"        "Calopogon tuberosus"            
##  [13] "Carex"                           "Carex atlantica"                
##  [15] "Carex exilis"                    "Carex folliculata"              
##  [17] "Carex lacustris"                 "Carex lasiocarpa"               
##  [19] "Carex limosa"                    "Carex magellanica"              
##  [21] "Carex Ovalis group"              "Carex pauciflora"               
##  [23] "Carex stricta"                   "Carex trisperma"                
##  [25] "Carex utriculata"                "Celastrus orbiculatus"          
##  [27] "Chamaedaphne calyculata"         "Comptonia peregrina"            
##  [29] "Cornus canadensis"               "Danthonia spicata"              
##  [31] "Dichanthelium acuminatum"        "Doellingeria umbellata"         
##  [33] "Drosera intermedia"              "Drosera rotundifolia"           
##  [35] "Dryopteris cristata"             "Dulichium arundinaceum"         
##  [37] "Empetrum nigrum"                 "Epilobium leptophyllum"         
##  [39] "Equisetum arvense"               "Eriophorum angustifolium"       
##  [41] "Eriophorum tenellum"             "Eriophorum vaginatum"           
##  [43] "Eriophorum virginicum"           "Eurybia macrophylla"            
##  [45] "Eurybia radula"                  "Festuca filiformis"             
##  [47] "Gaultheria hispidula"            "Gaylussacia baccata"            
##  [49] "Gaylussacia dumosa"              "Glyceria"                       
##  [51] "Glyceria striata"                "Ilex mucronata"                 
##  [53] "Ilex verticillata"               "Iris versicolor"                
##  [55] "Juncus acuminatus"               "Juncus canadensis"              
##  [57] "Juncus effusus"                  "Juniperus communis"             
##  [59] "Kalmia angustifolia"             "Kalmia polifolia"               
##  [61] "Larix laricina"                  "Lonicera - Exotic"              
##  [63] "Lupinus polyphyllus"             "Lysimachia terrestris"          
##  [65] "Maianthemum canadense"           "Maianthemum trifolium"          
##  [67] "Malus"                           "Melampyrum lineare"             
##  [69] "Monotropa uniflora"              "Morella pensylvanica"           
##  [71] "Muhlenbergia uniflora"           "Myrica gale"                    
##  [73] "Nuphar variegata"                "Oclemena nemoralis"             
##  [75] "Oclemena X blakei"               "Onoclea sensibilis"             
##  [77] "Osmunda regalis"                 "Osmundastrum cinnamomea"        
##  [79] "Phleum pratense"                 "Picea glauca"                   
##  [81] "Picea mariana"                   "Picea rubens"                   
##  [83] "Pinus banksiana"                 "Pinus strobus"                  
##  [85] "Pogonia ophioglossoides"         "Populus grandidentata"          
##  [87] "Populus tremuloides"             "Prenanthes"                     
##  [89] "Quercus rubra"                   "Ranunculus acris"               
##  [91] "Rhamnus frangula"                "Rhododendron canadense"         
##  [93] "Rhododendron groenlandicum"      "Rhynchospora alba"              
##  [95] "Rosa nitida"                     "Rosa palustris"                 
##  [97] "Rosa virginiana"                 "Rubus"                          
##  [99] "Rubus flagellaris"               "Rubus hispidus"                 
## [101] "Salix"                           "Salix petiolaris"               
## [103] "Sarracenia purpurea"             "Scirpus cyperinus"              
## [105] "Scutellaria"                     "Scutellaria lateriflora"        
## [107] "Solidago rugosa"                 "Solidago uliginosa"             
## [109] "Sorbus americana"                "Spiraea alba"                   
## [111] "Spiraea tomentosa"               "Symphyotrichum novi-belgii"     
## [113] "Symplocarpus foetidus"           "Thelypteris palustris"          
## [115] "Thuja occidentalis"              "Triadenum"                      
## [117] "Triadenum virginicum"            "Trichophorum cespitosum"        
## [119] "Trientalis borealis"             "Typha latifolia"                
## [121] "Utricularia cornuta"             "Vaccinium angustifolium"        
## [123] "Vaccinium corymbosum"            "Vaccinium macrocarpon"          
## [125] "Vaccinium myrtilloides"          "Vaccinium oxycoccos"            
## [127] "Vaccinium vitis-idaea"           "Veronica officinalis"           
## [129] "Viburnum nudum"                  "Viburnum nudum var. cassinoides"
## [131] "Vicia cracca"                    "Viola"                          
## [133] "Xyris montana"


sort(unique(ACAD_wetland$Latin_Name)) # equivalent
##   [1] "Acer rubrum"                     "Alnus incana"                   
##   [3] "Alnus incana++"                  "Amelanchier"                    
##   [5] "Andromeda polifolia"             "Apocynum androsaemifolium"      
##   [7] "Arethusa bulbosa"                "Aronia melanocarpa"             
##   [9] "Berberis thunbergii"             "Betula populifolia"             
##  [11] "Calamagrostis canadensis"        "Calopogon tuberosus"            
##  [13] "Carex"                           "Carex atlantica"                
##  [15] "Carex exilis"                    "Carex folliculata"              
##  [17] "Carex lacustris"                 "Carex lasiocarpa"               
##  [19] "Carex limosa"                    "Carex magellanica"              
##  [21] "Carex Ovalis group"              "Carex pauciflora"               
##  [23] "Carex stricta"                   "Carex trisperma"                
##  [25] "Carex utriculata"                "Celastrus orbiculatus"          
##  [27] "Chamaedaphne calyculata"         "Comptonia peregrina"            
##  [29] "Cornus canadensis"               "Danthonia spicata"              
##  [31] "Dichanthelium acuminatum"        "Doellingeria umbellata"         
##  [33] "Drosera intermedia"              "Drosera rotundifolia"           
##  [35] "Dryopteris cristata"             "Dulichium arundinaceum"         
##  [37] "Empetrum nigrum"                 "Epilobium leptophyllum"         
##  [39] "Equisetum arvense"               "Eriophorum angustifolium"       
##  [41] "Eriophorum tenellum"             "Eriophorum vaginatum"           
##  [43] "Eriophorum virginicum"           "Eurybia macrophylla"            
##  [45] "Eurybia radula"                  "Festuca filiformis"             
##  [47] "Gaultheria hispidula"            "Gaylussacia baccata"            
##  [49] "Gaylussacia dumosa"              "Glyceria"                       
##  [51] "Glyceria striata"                "Ilex mucronata"                 
##  [53] "Ilex verticillata"               "Iris versicolor"                
##  [55] "Juncus acuminatus"               "Juncus canadensis"              
##  [57] "Juncus effusus"                  "Juniperus communis"             
##  [59] "Kalmia angustifolia"             "Kalmia polifolia"               
##  [61] "Larix laricina"                  "Lonicera - Exotic"              
##  [63] "Lupinus polyphyllus"             "Lysimachia terrestris"          
##  [65] "Maianthemum canadense"           "Maianthemum trifolium"          
##  [67] "Malus"                           "Melampyrum lineare"             
##  [69] "Monotropa uniflora"              "Morella pensylvanica"           
##  [71] "Muhlenbergia uniflora"           "Myrica gale"                    
##  [73] "Nuphar variegata"                "Oclemena nemoralis"             
##  [75] "Oclemena X blakei"               "Onoclea sensibilis"             
##  [77] "Osmunda regalis"                 "Osmundastrum cinnamomea"        
##  [79] "Phleum pratense"                 "Picea glauca"                   
##  [81] "Picea mariana"                   "Picea rubens"                   
##  [83] "Pinus banksiana"                 "Pinus strobus"                  
##  [85] "Pogonia ophioglossoides"         "Populus grandidentata"          
##  [87] "Populus tremuloides"             "Prenanthes"                     
##  [89] "Quercus rubra"                   "Ranunculus acris"               
##  [91] "Rhamnus frangula"                "Rhododendron canadense"         
##  [93] "Rhododendron groenlandicum"      "Rhynchospora alba"              
##  [95] "Rosa nitida"                     "Rosa palustris"                 
##  [97] "Rosa virginiana"                 "Rubus"                          
##  [99] "Rubus flagellaris"               "Rubus hispidus"                 
## [101] "Salix"                           "Salix petiolaris"               
## [103] "Sarracenia purpurea"             "Scirpus cyperinus"              
## [105] "Scutellaria"                     "Scutellaria lateriflora"        
## [107] "Solidago rugosa"                 "Solidago uliginosa"             
## [109] "Sorbus americana"                "Spiraea alba"                   
## [111] "Spiraea tomentosa"               "Symphyotrichum novi-belgii"     
## [113] "Symplocarpus foetidus"           "Thelypteris palustris"          
## [115] "Thuja occidentalis"              "Triadenum"                      
## [117] "Triadenum virginicum"            "Trichophorum cespitosum"        
## [119] "Trientalis borealis"             "Typha latifolia"                
## [121] "Utricularia cornuta"             "Vaccinium angustifolium"        
## [123] "Vaccinium corymbosum"            "Vaccinium macrocarpon"          
## [125] "Vaccinium myrtilloides"          "Vaccinium oxycoccos"            
## [127] "Vaccinium vitis-idaea"           "Veronica officinalis"           
## [129] "Viburnum nudum"                  "Viburnum nudum var. cassinoides"
## [131] "Vicia cracca"                    "Viola"                          
## [133] "Xyris montana"

Data Exploration

Exploring data and fixing

We’ve already explored the wetland data a bit using head(), str(), and View(). These are functions that you will use over and over as you work with data in R. Below, I’m going to show how I get to know a data set in R.

Read in example NETN tree data from url

trees <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/NETN_tree_data.csv")

Look at first few records

head(trees)
##   Plot_Name ParkUnit PlotCode SampleDate IsQAQC SampleYear TagCode    TSN
## 1  MIMA-012     MIMA       12  6/16/2025  FALSE       2025      13 183385
## 2  MIMA-012     MIMA       12  6/16/2025  FALSE       2025      12  28728
## 3  MIMA-012     MIMA       12  6/16/2025  FALSE       2025      11  28728
## 4  MIMA-012     MIMA       12  6/16/2025  FALSE       2025       2  28728
## 5  MIMA-012     MIMA       12  6/16/2025  FALSE       2025      10  28728
## 6  MIMA-012     MIMA       12  6/16/2025  FALSE       2025       7  28728
##   ScientificName DBHcm TreeStatusCode CrownClassCode DecayClassCode
## 1  Pinus strobus  24.9             AS              5           <NA>
## 2    Acer rubrum  10.9             AB              5           <NA>
## 3    Acer rubrum  18.8             AS              3           <NA>
## 4    Acer rubrum  51.2             AS              3           <NA>
## 5    Acer rubrum  38.2             AS              3           <NA>
## 6    Acer rubrum  22.5             AS              4           <NA>

Look at summary of the columns

summary(trees)
##   Plot_Name           ParkUnit            PlotCode      SampleDate       
##  Length:164         Length:164         Min.   :11.00   Length:164        
##  Class :character   Class :character   1st Qu.:14.00   Class :character  
##  Mode  :character   Mode  :character   Median :16.50   Mode  :character  
##                                        Mean   :16.05                     
##                                        3rd Qu.:19.00                     
##                                        Max.   :20.00                     
##                                                                          
##    IsQAQC          SampleYear      TagCode          TSN        
##  Mode :logical   Min.   :2025   Min.   : 1.0   Min.   : 19049  
##  FALSE:164       1st Qu.:2025   1st Qu.: 7.0   1st Qu.: 24764  
##                  Median :2025   Median :12.5   Median : 28728  
##                  Mean   :2025   Mean   :13.6   Mean   : 62361  
##                  3rd Qu.:2025   3rd Qu.:19.0   3rd Qu.: 32929  
##                  Max.   :2025   Max.   :36.0   Max.   :565478  
##                                                                
##  ScientificName         DBHcm        TreeStatusCode     CrownClassCode 
##  Length:164         Min.   : 10.00   Length:164         Min.   :1.000  
##  Class :character   1st Qu.: 13.12   Class :character   1st Qu.:3.000  
##  Mode  :character   Median : 19.00   Mode  :character   Median :5.000  
##                     Mean   : 25.47                      Mean   :4.165  
##                     3rd Qu.: 28.45                      3rd Qu.:5.000  
##                     Max.   :443.00                      Max.   :6.000  
##                                                         NA's   :25     
##  DecayClassCode    
##  Length:164        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
There’s a lot to digest from the summary results.
  • We can see that Plot_Name, Network, and Park_Unit are treated as characters.
  • The range of plot numbers is 1 to 20, and there are no blanks (NAs)
  • SampleDate is being interpreted as a character, not date. We’ll fix that.
  • IsQAQC is being treated as TRUE/FALSE. We’ll use that to filter out QAQC visits.
  • SampleYear is all 2022.
  • DBH ranges from 10 to 443.0, with 14 blanks (NAs).
  • DecayClassCode is reading in as a character, not a number. We will look deeper into that next.

Look at structure of each column

str(trees)
## 'data.frame':    164 obs. of  13 variables:
##  $ Plot_Name     : chr  "MIMA-012" "MIMA-012" "MIMA-012" "MIMA-012" ...
##  $ ParkUnit      : chr  "MIMA" "MIMA" "MIMA" "MIMA" ...
##  $ PlotCode      : int  12 12 12 12 12 12 12 12 12 12 ...
##  $ SampleDate    : chr  "6/16/2025" "6/16/2025" "6/16/2025" "6/16/2025" ...
##  $ IsQAQC        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ SampleYear    : int  2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 ...
##  $ TagCode       : int  13 12 11 2 10 7 5 9 1 3 ...
##  $ TSN           : int  183385 28728 28728 28728 28728 28728 28728 28728 28728 28728 ...
##  $ ScientificName: chr  "Pinus strobus" "Acer rubrum" "Acer rubrum" "Acer rubrum" ...
##  $ DBHcm         : num  24.9 10.9 18.8 51.2 38.2 22.5 26.4 42.9 12.3 49 ...
##  $ TreeStatusCode: chr  "AS" "AB" "AS" "AS" ...
##  $ CrownClassCode: int  5 5 3 3 3 4 NA NA NA NA ...
##  $ DecayClassCode: chr  NA NA NA NA ...

Look at unique values for DecayClassCode.

sort(unique(trees$DecayClassCode)) # sorts the unique values in the column
## [1] "1"  "2"  "3"  "PM"
table(trees$DecayClassCode) # shows the number of records per value - very handy
## 
##  1  2  3 PM 
##  9  6  8  2

There are 2 records called “PM”, which stands for Permanently Missing in our forest data. We will convert PM to a blank, which R calls NA, and create a new decay class column that is converted to numeric.

Convert “PM” to blank. I will first make a copy of the data frame.

trees2 <- trees
trees2$DecayClassCode[trees2$DecayClassCode == "PM"] <- NA
trees2$DecayClassCode_num <- as.numeric(trees2$DecayClassCode)

# check that it worked
str(trees2)
## 'data.frame':    164 obs. of  14 variables:
##  $ Plot_Name         : chr  "MIMA-012" "MIMA-012" "MIMA-012" "MIMA-012" ...
##  $ ParkUnit          : chr  "MIMA" "MIMA" "MIMA" "MIMA" ...
##  $ PlotCode          : int  12 12 12 12 12 12 12 12 12 12 ...
##  $ SampleDate        : chr  "6/16/2025" "6/16/2025" "6/16/2025" "6/16/2025" ...
##  $ IsQAQC            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ SampleYear        : int  2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 ...
##  $ TagCode           : int  13 12 11 2 10 7 5 9 1 3 ...
##  $ TSN               : int  183385 28728 28728 28728 28728 28728 28728 28728 28728 28728 ...
##  $ ScientificName    : chr  "Pinus strobus" "Acer rubrum" "Acer rubrum" "Acer rubrum" ...
##  $ DBHcm             : num  24.9 10.9 18.8 51.2 38.2 22.5 26.4 42.9 12.3 49 ...
##  $ TreeStatusCode    : chr  "AS" "AB" "AS" "AS" ...
##  $ CrownClassCode    : int  5 5 3 3 3 4 NA NA NA NA ...
##  $ DecayClassCode    : chr  NA NA NA NA ...
##  $ DecayClassCode_num: num  NA NA NA NA NA NA 1 3 2 3 ...
sort(unique(trees2$DecayClassCode_num))
## [1] 1 2 3

The other option would be to drop records with PM. Here we will use the base R subset() function. You first have to tell it which data frame you’re subsetting. Then you tell it the logic to use to subset. In the case ! is interpreted in R as “NOT”. So DecayClassCode != “PM” means to keep all records where the decay code is Not equal to PM.

Remove records with “PM” as DecayClassCode

trees3 <- subset(trees, DecayClassCode != "PM")
trees3 <- trees[trees$DecayClassCode != "PM",] #equivalent but not as easy to follow
Basic plotting

Visualizing the data is also important to get a sense for the data and look for potential errors and outliers.

Histograms are a great start. The code below generates a basic histogram plot of a specific column in the dataframe using the hist() function.

Plot histogram of DBH measurements

hist(x = trees$DBHcm)

Looking at the histogram, it looks like all of the measurements are below 100cm except for one that’s way out in 400 range. You can also make a scatterplot of the data. If you only specify one column, the x axis will be the row number for each record, and the y axis will be the specified column.

Make point plot of DBH measurements

plot(trees$DBHcm)

Again, you can see there’s one value that’s greater than all of the others.

We can also plot two variables in a scatterplot.

Make scatterplot of crown class vs. DBH measurements

plot(trees$DBHcm ~ trees$CrownClassCode)

plot(DBHcm ~ CrownClassCode, data = trees) # equivalent but cleaner axis titles

Again, you can see there’s one value that’s greater than all of the others, and it’s crown class code 3 (codominant).

CHALLENGE: Using the skills you just learned, find the DBH record that’s > 400cm DBH.

Answer

There are multiple ways to do this. Two examples are below.

Option 1. View the data and sort by DBH.

View(trees)

Option 2. Find the max DBH value and subset the data frame

max_dbh <- max(trees$DBHcm, na.rm = TRUE)
trees[trees$DBHcm == max_dbh,]
##    Plot_Name ParkUnit PlotCode SampleDate IsQAQC SampleYear TagCode   TSN
## 26  MIMA-016     MIMA       16  6/17/2025  FALSE       2025       1 19447
##      ScientificName DBHcm TreeStatusCode CrownClassCode DecayClassCode
## 26 Quercus velutina   443             AS              3           <NA>

CHALLENGE: Using the skills you just learned, what is the value of the largest DBH, and which record does it belong to?

Answer

There are multiple ways to do this. Two examples are below.

Option 1. View the data and sort by DBH.

View(trees)

Option 2. Find the max DBH value and subset the data frame

max_dbh <- max(trees$DBHcm, na.rm = TRUE)
max_dbh #443
## [1] 443
trees[trees$DBHcm == max_dbh,]
##    Plot_Name ParkUnit PlotCode SampleDate IsQAQC SampleYear TagCode   TSN
## 26  MIMA-016     MIMA       16  6/17/2025  FALSE       2025       1 19447
##      ScientificName DBHcm TreeStatusCode CrownClassCode DecayClassCode
## 26 Quercus velutina   443             AS              3           <NA>
# Plot MIMA-016, TagCode = 1.
Replace value based on pattern matching

Now let’s say that you looked at the datasheet, and the actual DBH for that tree was 44.3 instead of 443.0. You can change that value in the original CSV by hand. But even better is to document that change in code. I always create a new data frame when I modify the original data frame, so I can always refer back to the original while I’m coding. I also use a pretty specific filter to make sure I’m not accidentally changing other data.

Replace 443 with 44.3 in code

# create copy of trees data
trees_fix <- trees

# find the problematic DBH value, and change it to 44.3
trees_fix$DBHcm[trees_fix$Plot_Name == "MIMA-016" & trees_fix$TagCode == 1 & trees_fix$DBHcm == "443"] <- 44.3

CHALLENGE: How would you check that the line of code above worked?

Answer

There are multiple ways to do this. Two examples are below.

Option 1. Show the range of the original and fixed data frames

range(trees$DBHcm)
## [1]  10 443
range(trees_fix$DBHcm)
## [1] 10.0 81.5

Option 2. Plot a histogram of the original and fixed data frames

hist(trees$DBHcm)

hist(trees_fix$DBHcm) 

Option 3. Calculate max of DBHcm column

max(trees$DBHcm)
## [1] 443
max(trees_fix$DBHcm)
## [1] 81.5

Getting Help

Help Documentation

There are a number of options to get help with R. If you’re trying to figure out how to use a function, you can type ?function_name. For example ?plot will show the R documentation for that function in the Help panel.

Get help for the functions below

?plot
?dplyr::filter
You can also press F1 while the cursor is on a function name to access the help for that function. Help documents in R are standardized to help you find what you’re looking for.
  • Top left shows the function with the {package}. Base means it’s a function in the base R install.
  • Description: tell you what the function is.
  • Usage: tells you what the arguments are. The “…” means there are other potential arguments, but isn’t something we need to talk about right now.
  • Arguments: define the arguments and what their inputs take. For example, if an arguement is TRUE or FALSE, or a text string.
  • Value: Describes more about the function (not always included)
  • See Also: Sometimes functions build on other functions. This section links to similar or building block functions.
  • Examples: Functions that provide good examples are invaluable. Sometimes these are afterthoughts or not included in help documentation, which is too bad. Unfortunately, base R functions tend to have some of the most obscure, hard to understand examples.


Troubleshooting errors

Great online resources to find answers to questions include Stackexchange, and Stackoverflow. Google searches are usually my first step, and I include “in R” and the package name (if applicable) in every search related to R code. If you’re troubleshooting an error message, copying and pasting the error message verbatim into a search engine often helps.

Don’t hesitate to reach out to colleagues for help as well! If you are stuck on something and the answers on Google are more confusing than helpful, don’t be afraid to ask a human. Every experienced R programmer was a beginner once, so chances are they’ve encountered the same problem as you at some point. There is an R-focused Data Science Community of Practice for I&M folks, which anyone working in R (regardless of experience!) is invited and encouraged to join.

Common errors and how to fix them
  1. Unmatched parenthesis

  2. mean_x <- mean(c(1, 3, 5, 7, 8, 21) # missing closing parentheses
    mean_x <- mean(c(1, 3, 5, 7, 8, 21)) # correct
  3. Unmatched quotes

  4. birds <- c("black-capped chickadee", "golden-crowned kinglet, "wood thrush",) # missing quote after kinglet
    birds <- c("black-capped chickadee", "golden-crowned kinglet", "wood thrush") # corrected
  5. Missing a comma between elements

  6. birds <- c("black-capped chickadee", "golden-crowned kinglet" "wood thrush") # missing comma after kinglet
    birds <- c("black-capped chickadee", "golden-crowned kinglet", "wood thrush") # corrected
  7. Misspelled function name

  8. x_mean <- maen(x) # mispelled mean
    x_mean <- mean(x) # Corrected
  9. Incorrect use of dimensions with brackets

    # Missing comma to indicate subsetting rows (records)
    ACAD_wetland2 <- ACAD_wetland[!is.na(ACAD_wetland$Site_Name)]
    ## Error in `[.data.frame`(ACAD_wetland, !is.na(ACAD_wetland$Site_Name)): undefined columns selected
    # Correct
    ACAD_wetland2 <- ACAD_wetland[!is.na(ACAD_wetland$Site_Name), ]
Other resources that may help:

Keyboard Shortcuts

Once you get in the swing of coding, you’ll find that minimizing the number of times you have to use your mouse will help you code faster. RStudio has done a great job creating lots of really useful keyboard shortcuts designed to keep your hands on the keyboard instead of having to click through menus. One way to see all of the shortcuts RStudio has built in is to press Alt+Shift+K. A window should appear with a bunch of shortcuts listed. These are also listed on List of RStudio IDE Keyboard Shortcuts. The shortcuts I use the most often are listed below:
  • Undo: Ctrl Z
  • Redo: Ctrl Shift Z
  • Run highlighted code: Ctrl Enter
  • Insert “<-” : Alt -
  • Zoom in to make text bigger: Ctrl roll mouse forward (set in Global Options)
  • Zoom out: Ctrl - or Ctrl roll mouse backward (set in Global Options)
  • Move line of code up or down: Alt arrow up or down
  • Comment out whole line: Ctrl Shift C
  • Duplicate line of code: Ctrl Shift D
  • Move cursor to beginning of line: Home
  • Move cursor to end of line: End
  • View help for a given function: Put cursor on function name and press F1</li?
  • Esc escapes out of the command currently being executed in the console
  • Restart R Session: Ctrl Shift F10
  • Insert pipe (|>): Ctrl Shift M
  • View RStudio’s keyboard shortcuts: Alt Shift K

Day 2: Wrangling and Viz I

  • Data wrangling I:
    • Base R filter, subset, new column, rename column
    • dplyr: filter, select, mutate, rename, and pipe
    • Philosophy behind the tidyverse
      • Tidy data format - rows are observations, columns are variables.
      • All functions start with specifying the data first, making them pipeable.
    • dealing with NAs/0s
    • Dates/times in hobo data
  • Data visualization philosophy
  • Intro to ggplot2
    • language of ggplot
    • basic plotting (default formatting)

Day 3: Wrangling and Viz II

  • Data wrangling II:
    • group_by and summarize
    • group_by and mutate
    • pivot_longer and pivot_wider
    • joining tables
  • ggplot II:
    • customizing plots
    • combining plots
    • custom color palettes
  • coding best practices (design with the user in mind)
    • comment code
    • parameters, packages, datasets on top
    • consistent code style
    • object naming

Resources

Online Resources

There’s a lot of great online material for learning new applications of R. The ones we’ve used the most are listed below.

Online Books

  • R for Data Science First author is Hadley Wickham, the programmer behind the tidyverse. There’s a lot of good stuff in here, including a chapter on using R Markdown, which is what we used to generate this training website.
  • ggplot2: Elegant Graphics for Data Analysis A great reference on ggplot2 also by Hadley Wickham.
  • Mastering Software Development in R First author is Roger Peng, a Biostatistics professor at John Hopkins, who has taught a lot of undergrad/grad students how to use R. He’s also one of the hosts of Not So Standard Deviations podcast. His intro to ggplot is great. He’s also got a lot of more advanced topics in this book, like making functions and packages.
  • R Packages Another book by Hadley Wickham that teaches you how to build, debug, and test R packages.
  • Advanced R Yet another book by Hadley Wickham that helps you understand more about how R works under the hood, how it relates to other programming languages, and how to build packages.
  • Mastering Shiny And another Hadley Wickham book on building shiny apps.

Other useful sites

  • NPS_IMD_Data_Science_and_Visualization > Community of Practice is an IMD work group that meets once a month talk about R and Data Science. There are also notes, materials and recordings from previous meetings, a Wiki with helpful tips, and the chat is a great place to post questions or cool tips you’ve come across.
  • STAT545 Jenny Bryan’s site that accompanies the graduate level stats class of the same name. She includes topics on best practices for coding, and not just how to make scripts work. It’s really well done.
  • RStudio home page There’s a ton of info in the Resources tab on this site, including cheatsheets for each package developed by RStudio (ie tidyverse packages), webinars, presentations from past RStudio Conferences, etc.
  • RStudio list of useful R packages by topic
  • Happy Git with R If you find yourself wanting to go down the path of hosting your code on github, this site will walk you through the process of linking github to RStudio.

Advanced topics

  • R Markdown (stand alone websites/docs) and R Shiny (interactive websites)
  • Writing your own functions and iteration

Code printout